Final: Effects of Air Pollution on Countries

1. Introduction

Motivation

Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?

Set Up

Before we start, we need to ensure that we have all the relevant libraries installed and imported.

Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.

install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")

2. Data Sources and Data Wrangling

Data Sources

For our analysis, we will be working with 5 main data sources shown in the table below:

Figure 1: Data Sources
Data Source Link
Deaths Due to Air Pollution of Countries from 1990 - 2017 Kaggle Link
GDP Annual Growth of Countries from 1960 - 2020 Kaggle via WorldBank Link
United Nations Population and Region Data United Nations Link
United Nations ISO-alpha3 code United Nations Link
spData for Map Geometries spData for Mapping Link

The main variables in our datasets will include:

Figure 2: Key Variables
Feature Data Type Unit of Measure Notes and Assumptions
GDP (Gross Domestic Product) Numerical, Continuous $USD This is our chosen proxy for measuring a country’s economic status
Population Size Numerical, Continuous thousands of people Annual UN estimated
Deaths due to Air Pollution Numerical, Continuous deaths per million This is our chosen proxy for measuring the negative affects of air pollution.
Country Qualitative, Categorical N/A 231 countries
SDG Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Region Classification.
Sub Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Sub-Region Classification.
ISO-alpha3 Country Code Qualitative, Categorical N/A Standard for identifying countries (text ID).
ISO-alpha2 Country Code Qualitative, Categorical N/A Another standard for identifying countries (text ID).
M49 Country Code Numerical, Categorical N/A Another standard for identifying countries (numerical ID).
Year Numerical, Categorical N/A 1990 to 2017
GDP per Capita Numerical, Continuous $USD per person Normalization of GDP to compare between population sizes (calculated).

Data Wrangling

While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.

UN Data Sample Messy
Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
UN Data Sample Cleaned
Figure 4: Sample screenshot of transformed UN dataset.

3. Load, Clean, and Inspect Data

Load Data

Figure 5: Structure of country_codes_df
variable class first_values
Country.or.Area character Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code character AD, AE, AF, AG, AI, AL
ISO.alpha3.code character AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8
Figure 6: Structure of air_pollution_df
variable class first_values
Entity character Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
Code character AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Air.pollution..total…deaths.per.100.000. double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243
Indoor.air.pollution..deaths.per.100.000. double 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878
Outdoor.particulate.matter..deaths.per.100.000. double 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974
Outdoor.ozone.pollution..deaths.per.100.000. double 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707
Figure 7: Structure of gdp_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
Country.Name character Aruba, Afghanistan, Angola, Albania, Andorra, Arab World
Country.Code character ABW, AFG, AGO, ALB, AND, ARB
Indicator.Name character GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\))
Indicator.Code character NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD
X1960 double NA, 537777811.111111, NA, NA, NA, NA
X1961 double NA, 548888895.555556, NA, NA, NA, NA
X1962 double NA, 546666677.777778, NA, NA, NA, NA
X1963 double NA, 751111191.111111, NA, NA, NA, NA
X1964 double NA, 800000044.444444, NA, NA, NA, NA
X1965 double NA, 1006666637.77778, NA, NA, NA, NA
Figure 8: Structure of population_region_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
SDGRegion character SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA
SubRegion character Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa
Country character Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya
Notes integer NA, NA, NA, NA, NA, NA
Country.code integer 108, 174, 262, 232, 231, 404
Type character Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area
Parent.code integer 910, 910, 910, 910, 910, 910
X1950 character 2 309, 159, 62, 822, 18 128, 6 077
X1951 character 2 360, 163, 63, 835, 18 467, 6 242
X1952 character 2 406, 167, 65, 849, 18 820, 6 416
Figure 9: Structure of world, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 character FJ, TZ, EH, CA, US, KZ
name_long character Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan
continent character Oceania, Africa, Africa, North America, North America, Asia
region_un character Oceania, Africa, Africa, Americas, Americas, Asia
subregion character Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia
type character Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country
area_km2 double 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781
pop double 885806, 52234869, NA, 35535348, 318622525, 17288285
lifeExp double 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62
gdpPercap double 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466

Clean Data

First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).

Clean air_pollution_df:

Figure 10: Structure of air_pollution_df_cleaned
variable class first_values
Country integer Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
ISO.alpha3.code integer AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Deaths.Air.Pollution.per.100k double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243

Clean gdp_df:

Figure 11: Structure of gdp_df_cleaned
variable class first_values
Country integer Aruba, Aruba, Aruba, Aruba, Aruba, Aruba
ISO.alpha3.code integer ABW, ABW, ABW, ABW, ABW, ABW
Year integer 1986, 1987, 1988, 1989, 1990, 1991
GDP.USD double 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799

Clean population_region_df:

Figure 12: Structure of population_region_df_cleaned
variable class first_values
SDGRegion integer SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA
SubRegion integer EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica
Country integer Burundi, Burundi, Burundi, Burundi, Burundi, Burundi
M49.code integer 108, 108, 108, 108, 108, 108
Year integer 1950, 1951, 1952, 1953, 1954, 1955
Population.thousands double 2309, 2360, 2406, 2449, 2492, 2537

Clean population_region_df:

Figure 13: Structure of country_codes_df
variable class first_values
Country.or.Area integer Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code integer AD, AE, AF, AG, AI, AL
ISO.alpha3.code integer AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8

Clean world:

Figure 14: Structure of world_df_cleaned, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 integer FJ, TZ, EH, CA, US, KZ

Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.

Final DataFrame Construction

Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).

Figure 15: Structure of final_df
variable class first_values
ISO.alpha2.code integer AD, AD, AD, AD, AD, AD
M49.code integer 20, 20, 20, 20, 20, 20
Year integer 2012, 2013, 1990, 1991, 1992, 1993
ISO.alpha3.code integer AND, AND, AND, AND, AND, AND
Country.x integer Andorra, Andorra, Andorra, Andorra, Andorra, Andorra
Deaths.Air.Pollution.per.100k double 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189
GDP.USD double 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065
SDGRegion integer EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE
SubRegion integer SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope
Population.thousands double 82, 81, 55, 57, 59, 61
geom list list(), list(), list(), list(), list(), list()
gdp.per.capita double 38887.9139337455, 39428.4486815589, 18709.9723978275, 19419.7996994086, 20508.7059640192, 16508.6189344369

Our dataset is finally ready to be analyzed.

4. EDA Summary

All necessary EDA was performed in the Midterm Report where we looked at distributions for each key numerical variable in Air Pollution Induced Deaths per 100k, Population, GDP per Capita. We also looked into boxplots of these key numerical variables by Regions and Sub Regions. Interestingly, we observed that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively.

Next, we checked for outliers and found that Deaths.Air.Pollution.per.100k definitely do not need to have outliers removed as it only represents 0.7% of the dataset. gdp.per.capita have higher percentage of classified outliers at 13.5%, however, we felt that keeping the extreme datapoints of this feature was important in our research and analysis to capture the true disproportionate distribution of wealth and progress across countries in our world.

Lastly, we explored a few choropleth maps to visually understand our dataset better.

We intentionally did not include any figures from EDA in this final report to preserve space. Please look at the Midterm Report for details on our entire EDA process.

5. Main Research Question

Below is a rehash of our main research question from the Midterm report and we felt that the findings from this model was important enough to warrant inclusion in this final report as well.

Do lower GDP countries have more deaths per 100k due to air pollution?

Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?

Linear Fit

Let’s first look at the general fit on the overall data.

Figure 15: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.

From the plot, we observed that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.

Transformed Log Scale - Linear Fit

We then performed a log transformation and found that our linear regression fits much better.

fit2’s summary statistics are:

## 
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita), 
##     data = final_df_sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.099 -0.235  0.000  0.206  1.431 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)          7.38778    0.02663     277 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952    0.00323    -121 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.737 
## F-statistic: 1.45e+04 on 1 and 5195 DF,  p-value: <0.0000000000000002

We re-plotted the linear model and here were the results.

Figure 16: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.

Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.

We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:

\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 7.38778 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]

or solving for our target variable:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?

We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

One-Way ANOVA Test

We started off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp

H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal

We will use an \(\alpha\) value of 0.05.

## $`1`
## [1]  95.2 920.3
## 
## $`2`
## [1]  921 3100
## 
## $`3`
## [1]  3104 11493
## 
## $`4`
## [1]  11497 119106
##               Df   Sum Sq Mean Sq F value              Pr(>F)    
## qnt            3 10735714 3578571    3133 <0.0000000000000002 ***
## Residuals   5193  5932162    1142                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.

2-Sample T-Tests

We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:

  • Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
  • Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
  • Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp

We used a two sample t-test for each and used an \(\alpha\) value of 0.05.

Test 1:

Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.

Test 2:

Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 3:

Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 4:

Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 5:

Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 6:

Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Midterm Main Research Results

From all of our tests above, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

The strength of the correlation can be quantified by our R2 value of 0.737 from Figure 16.

6. Smart Questions for Further Modeling

1. What are the impacts of population size from GDP and Deaths due to Air Pollution globally?

  1. Categorizing GDP into low (0), medium(1) and high(2)
SQ6 <- final_df
library(dplyr)
groupings_pop <- SQ6 %>% mutate(population_grouping = case_when(Population.thousands >=  100001 & Population.thousands <= 1000000000 ~ 2,
                                             Population.thousands >= 50001  & Population.thousands <= 100001 ~ 1,
                                             Population.thousands >= 0 & Population.thousands <= 50000 ~ 0)) # end function
  1. Building a logit model GDP as predictor on Population
pop_gdpLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita, data = groupings_pop)
summary(pop_gdpLogit)
## 
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita, 
##     data = groupings_pop)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.246  -0.185  -0.181  -0.180   1.820  
## 
## Coefficients:
##                                 Estimate  Std. Error t value
## (Intercept)                  0.180177461 0.008492388   21.22
## groupings_pop$gdp.per.capita 0.000000553 0.000000456    1.21
##                                         Pr(>|t|)    
## (Intercept)                  <0.0000000000000002 ***
## groupings_pop$gdp.per.capita                0.23    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.267)
## 
##     Null deviance: 1389.8  on 5196  degrees of freedom
## Residual deviance: 1389.4  on 5195  degrees of freedom
## AIC: 7899
## 
## Number of Fisher Scoring iterations: 2
  1. Bulding a logit model GDP and Deaths due to Air Population as predictor on Population
pop_gdp_deathsLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita + groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
summary(pop_gdp_deathsLogit)
## 
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita + 
##     groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.215  -0.194  -0.186  -0.169   1.842  
## 
## Coefficients:
##                                                 Estimate   Std. Error t value
## (Intercept)                                  0.202910224  0.018188095   11.16
## groupings_pop$gdp.per.capita                 0.000000136  0.000000543    0.25
## groupings_pop$Deaths.Air.Pollution.per.100k -0.000213150  0.000150811   -1.41
##                                                        Pr(>|t|)    
## (Intercept)                                 <0.0000000000000002 ***
## groupings_pop$gdp.per.capita                               0.80    
## groupings_pop$Deaths.Air.Pollution.per.100k                0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.267)
## 
##     Null deviance: 1389.8  on 5196  degrees of freedom
## Residual deviance: 1388.9  on 5194  degrees of freedom
## AIC: 7899
## 
## Number of Fisher Scoring iterations: 2

2. What are the effects of (low or high) GDP and population size on deaths due to air pollution in Sub-Saharan Africa?

3. Can we let the data tell us what type of groupings exist in our dataset? How consistent are they to our preconceived groupings such as region or developed vs developing countries?

We wanted to better understand what types of groupings exist within our dataset. More concretely, we wanted to put aside our preconceived presumptions about our dataset and create a K-Means clustering model and allow the the data to guide us to the possible clusters that may exist within our data.

K-Means clustering is an unsupervised machine learning algorithm that is very useful to parse the dataset and identify K number of groups where observations within each group have high similarity.

First, let’s create an index label.

Figure 6.3-0: Head of final_df_cluster
index_ region Deaths.Air.Pollution.per.100k gdp.per.capita Population.thousands
Australia Australia AUSTRALIA/NEWZEALAND 17.8 35412 20283
New Zealand New Zealand AUSTRALIA/NEWZEALAND 15.9 25279 4057
Afghanistan Afghanistan CENTRALANDSOUTHERNASIA 227.8 430 29282
Bangladesh Bangladesh CENTRALANDSOUTHERNASIA 144.5 620 133736
Bhutan Bhutan CENTRALANDSOUTHERNASIA 115.9 1408 627
India India CENTRALANDSOUTHERNASIA 165.9 838 1115445

Next, let’s select only the numerical values from our dataset.

Figure 6.3-1: Structure of final_df_numerical
variable class first_values
Deaths.Air.Pollution.per.100k double 17.768147178761, 15.9253637864585, 227.765254524006, 144.513734731741, 115.896472640701, 165.913668178177
gdp.per.capita double 35411.5137067032, 25278.8349742491, 430.287787933406, 619.590316576761, 1408.24273427377, 838.320549169855
Population.thousands double 20282.8928571429, 4056.67857142857, 29282, 133735.607142857, 627, 1115445.42857143

Optimal K Clusters

Let’s try to find the optimal K value to achieve best clustering results. The 2 that we will try are using:

  1. Elbow Method with Total Within Sum of Squares
  2. Looking at the Gap Statistics

Elbow Method with Total Within Sum of Squares

Figure 6.3-2: Finding the optimal number of clusters using the Within Sum of Squares Method.

From the total Within Sum of Squares plot, we find that the optimal number of K seems to be at 4 using the Elbow method which identifies the K number where the Total Sum of Squares begins to level off, or where adding additional K clusters only improves our model marginally.

Looking at the Gap Statistics

Figure 6.3-3: Finding the optimal number of clusters using the Gap Statistic.

From the Gap Statistic plot, we find that the optimal number of K seems to be at 10 with the highest gap statistic, however, we observed using the Elbow method that the Total Sum of Squares begins to level off from 4 to 10. We will stick with using K = 4.

Perform K-Means Clustering with Optimal K

We perform a K-Means clustering analysis with a K of 4 and here are the results below.

Figure 6.3-4: K-Means Clustering Average Values per Cluster
Cluster Deaths.Air.Pollution.per.100k gdp.per.capita Population.thousands
1 1.16234711051454 -0.61738173258503 -0.101796957671015
2 1.01091758337588 -0.564333841794921 9.25163623841697
3 -1.04092828247631 1.92017020504228 -0.0578237235004342
4 -0.494528110744794 -0.253250480051675 -0.107965219042902
Figure 6.3-5: Number of Observations Within Each Cluster
Cluster Observations with Each Cluster
1 67
2 2
3 34
4 90
Figure 6.3-6: Total Sum of Squares Within Each Cluster
Cluster Total Sum of Squares with Each Cluster
1 30.88
2 1.52
3 34.88
4 26.96

Let’s take a look at clusters visually. Interestingly, our model has grouped China and India into their own separate cluster, Cluster 2. Clusters 4, 3, and 1 can be roughly characterized by high GDP per captita nations, medium GDP per capita nations, and low GDP per capita nations’.

Figure 3.6-7: Visualing K-Means Clustering Model.

Let’s take a closer look at each cluster’s breakdown of subregions more closely to determine what type of discrepancies we find between each cluster.

Figure 6.3-8: Frequency % of SDGRegion by Clusters.

Interestingly, we observe that Cluster 1 is disproportionately comprised of countries in Sub-Saharan Africa with large representations from Oceania, Eastern & South-Eastern Asia, and Central & Southern Asia. Cluster 2 is the one cluster comprised solely of China and India. Cluster 3 has a much larger proportion of countries from Europe, Northen America, and Northern Africa & Western Asia. Cluster 4 is similar to Cluster 3 except it does not have countries from Northern America and has more countries from Sub-Saharan Africa and Central & Southern Asia.

4. Can we make a prediction of the future GDP by considering the population size, location, and the reation of population and deaths due to air pollution?

For our major model we make a nice prediction for GDP through total deaths caused by air pollution, so as we can see that through the distribution of the data, there are strong relationships between total deaths caused by air pollution, deaths caused by indoor air pollution, and deaths caused by outdoor air pollution.

# select what we need
# total
#ggplot(total3, aes(x=Year, y=air_pollution_total)) +
#    geom_bar(stat='identity', position='dodge')

# indoor
#ggplot(total3, aes(x=Year, y=air_pollution_indoor)) +
#    geom_bar(stat='identity', position='dodge')
# outdoor particulate
#ggplot(total3, aes(x=Year, y=air_pollution_outdoorP)) +
#    geom_bar(stat='identity', position='dodge')
# outdoor ozone 
#ggplot(total3, aes(x=Year, y=air_pollution_outdoorO)) +
#    geom_bar(stat='identity', position='dodge')


air_pollution_df_cleaned1 <- air_pollution_df

So it seems like we can also get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.

# make Entity and year factor
air_pollution_df_cleaned1$Entity = factor(air_pollution_df_cleaned1$Entity)
air_pollution_df_cleaned1$Code = factor(air_pollution_df_cleaned1$Code)
air_pollution_df_cleaned1$Year = factor(air_pollution_df_cleaned1$Year)
# polution
air_pollution_df_cleaned1 <- rename(air_pollution_df_cleaned1, Country = Entity, ISO.alpha3.code = Code, Deaths.Air.Pollution.per.100k = Air.pollution..total...deaths.per.100.000. , Deaths.Air.Pollution.Indoor.per.100k = Indoor.air.pollution..deaths.per.100.000., Deaths.Air.Pollution.OutdoorP.per.100k = Outdoor.particulate.matter..deaths.per.100.000., Deaths.Air.Pollution.OutdoorO.per.100k = Outdoor.ozone.pollution..deaths.per.100.000.)


# population
#population_region_df_cleaned <- subset(population_region_df, select = -c(Notes, Type, Parent.code))

#population_region_df_cleaned <- population_region_df_cleaned %>%
#  pivot_longer(
#    cols = starts_with("X"), 
#    names_to = "Year", 
 #   values_to = "Population.thousands", 
  #  values_drop_na = TRUE
#  )

# population_region_df_cleaned$Year<-gsub("X","",as.character(population_region_df_cleaned$Year))
# population_region_df_cleaned <- as.data.frame(apply(population_region_df_cleaned, 2, function(x) gsub("\\s+", "", x))) 

# population_region_df_cleaned$Country = factor(population_region_df_cleaned$Country)
# population_region_df_cleaned$SDGRegion = factor(population_region_df_cleaned$嚜燙DGRegion)
# population_region_df_cleaned$Country.code = factor(population_region_df_cleaned$Country.code)
# population_region_df_cleaned$SubRegion = factor(population_region_df_cleaned$SubRegion)
# population_region_df_cleaned$Year = factor(population_region_df_cleaned$Year)
# population_region_df_cleaned$Population.thousands = as.numeric(population_region_df_cleaned$Population.thousands)

# population_region_df_cleaned <- rename(population_region_df_cleaned, M49.code = Country.code)

# country_codes_df$M49.code = factor(country_codes_df$M49.code)
# country_codes_df$Country.or.Area = factor(country_codes_df$嚜澧ountry.or.Area)
# country_codes_df$ISO.alpha3.code = factor(country_codes_df$ISO.alpha3.code)
# country_codes_df$ISO.alpha2.code = factor(country_codes_df$ISO.alpha2.code)

# world$iso_a2 = factor(world$iso_a2)
# 
# world_df_cleaned <- subset(world, select = c(iso_a2, geom))

# merge
final_dfc <- merge(x = air_pollution_df_cleaned1, y = country_codes_df, by = 'ISO.alpha3.code')

final_dfc <- merge(x = final_dfc, y = gdp_df_cleaned, by = c('ISO.alpha3.code', 'Year'))
final_dfc <- merge(x = final_dfc, y = population_region_df_cleaned, by = c('M49.code', 'Year'))
final_dfc <- merge(x = final_dfc, y = world_df_cleaned, by.x = 'ISO.alpha2.code', by.y = 'iso_a2', all.x = TRUE) #left join

# Remove columns
final_dfc <- subset(final_dfc, select = -c(Country.or.Area, Country.y, Country))

# Calculate GDP per capita 
final_dfc$gdp.per.capita <- final_dfc$GDP.USD / final_dfc$Population.thousands


final_dfc$Deaths.Air.Pollution.Outdoor.Total.per.100k <- final_dfc$Deaths.Air.Pollution.OutdoorO.per.100k + final_dfc$Deaths.Air.Pollution.OutdoorP.per.100k

For the first model, we build a linear model using ‘gdp.per.capita’ and ‘Deaths.Air.Pollution.Indoor.per.100k’.

final_df_sfc = st_as_sf(final_dfc)
fit2c <- lm(gdp.per.capita ~Deaths.Air.Pollution.Indoor.per.100k, data=final_df_sfc)

summary(fit2c)
## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k, 
##     data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -15508268  -8789754  -3652222   3500260 102785447 
## 
## Coefficients:
##                                      Estimate Std. Error t value
## (Intercept)                          16339920     254681    64.2
## Deaths.Air.Pollution.Indoor.per.100k  -126647       3308   -38.3
##                                                 Pr(>|t|)    
## (Intercept)                          <0.0000000000000002 ***
## Deaths.Air.Pollution.Indoor.per.100k <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13900000 on 5195 degrees of freedom
## Multiple R-squared:  0.22,   Adjusted R-squared:  0.22 
## F-statistic: 1.47e+03 on 1 and 5195 DF,  p-value: <0.0000000000000002
ggplot(final_df_sfc, aes(gdp.per.capita, Deaths.Air.Pollution.Indoor.per.100k)) +
  geom_point() +
  geom_smooth(method='lm', se=FALSE) +
  stat_regline_equation(label.y = 350, aes(label = ..eq.label..)) +
  stat_regline_equation(label.y = 400, aes(label = ..rr.label..))
Figure 6.4-1: Deaths due to Indoor Air Pollution vs GDP per capita.

The R-squared value of this model is 0.22, which is a really bad result. So follow the steps in the major model building, now we put these two features into a log() and it should performed better.

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Indoor.per.100k), 
##     data = final_df_sfc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.354 -0.599  0.098  0.647  2.892 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                               16.33257    0.01763     927
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.54761    0.00517    -106
##                                                      Pr(>|t|)    
## (Intercept)                               <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.891 on 5195 degrees of freedom
## Multiple R-squared:  0.684,  Adjusted R-squared:  0.684 
## F-statistic: 1.12e+04 on 1 and 5195 DF,  p-value: <0.0000000000000002
Figure 6.4-2: Log Deaths due to Indoor Air Pollution vs Log GDP per capita.

As we can see, although it performed much better with a R-squared value 0.68 but this is still not good as the major model.

Next we try the deaths caused by outdoor air pollution ‘Deaths.Air.Pollution.Outdoor.Total.per.100k’.

## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Outdoor.Total.per.100k, 
##     data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14051236  -9537840  -5335898   2518144 106063637 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 15643215     461880    33.9
## Deaths.Air.Pollution.Outdoor.Total.per.100k  -150382      10830   -13.9
##                                                        Pr(>|t|)    
## (Intercept)                                 <0.0000000000000002 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15400000 on 5195 degrees of freedom
## Multiple R-squared:  0.0358, Adjusted R-squared:  0.0356 
## F-statistic:  193 on 1 and 5195 DF,  p-value: <0.0000000000000002
Figure 6.4-3: Deaths due to Outdoor Air Pollution vs GDP per capita.

The performence of this model is also bad, with a R-squared value 0.036, so we put the features in the log() too.

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k), 
##     data = final_df_sfc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.591 -1.272 -0.027  1.297  3.471 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                       15.6988     0.1565  100.28
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k)  -0.1991     0.0442   -4.51
##                                                              Pr(>|t|)    
## (Intercept)                                      < 0.0000000000000002 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k)            0.0000068 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 5195 degrees of freedom
## Multiple R-squared:  0.00389,    Adjusted R-squared:  0.0037 
## F-statistic: 20.3 on 1 and 5195 DF,  p-value: 0.00000675
Figure 6.4-4: Log Deaths due to Outdoor Air Pollution vs Log GDP per capita.

A surprised point is that although we put the variables into log(), the performance of model is still bad and even worse. The R-squared value of this model is 0.0039.

Finally, we are going to test build a model base on both deaths caused by outdoor and indoor air pollution. To check if this model also works bad.

## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k + 
##     Deaths.Air.Pollution.Outdoor.Total.per.100k, data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -18198746  -7911990  -2886359   3411956  96919910 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 26403084     457576    57.7
## Deaths.Air.Pollution.Indoor.per.100k         -144489       3190   -45.3
## Deaths.Air.Pollution.Outdoor.Total.per.100k  -242555       9394   -25.8
##                                                        Pr(>|t|)    
## (Intercept)                                 <0.0000000000000002 ***
## Deaths.Air.Pollution.Indoor.per.100k        <0.0000000000000002 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13100000 on 5194 degrees of freedom
## Multiple R-squared:  0.309,  Adjusted R-squared:  0.309 
## F-statistic: 1.16e+03 on 2 and 5194 DF,  p-value: <0.0000000000000002
Figure 6.4-5: Deaths due to Indoor + Outdoor Air Pollution vs GDP per capita.

The R-squared value of this model is 0.29. Next we put these variables into log().

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k) + 
##     log(Deaths.Air.Pollution.Indoor.per.100k), data = final_df_sfc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.200 -0.594  0.055  0.621  2.901 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                      17.60605    0.08820   199.6
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.35985    0.02444   -14.7
## log(Deaths.Air.Pollution.Indoor.per.100k)        -0.55212    0.00507  -108.8
##                                                             Pr(>|t|)    
## (Intercept)                                      <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) <0.0000000000000002 ***
## log(Deaths.Air.Pollution.Indoor.per.100k)        <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.873 on 5194 degrees of freedom
## Multiple R-squared:  0.696,  Adjusted R-squared:  0.696 
## F-statistic: 5.96e+03 on 2 and 5194 DF,  p-value: <0.0000000000000002
Figure 6.4-6: Log Deaths due to Indoor + Outdoor Air Pollution vs Log GDP per capita.

For this model the R-squared value is 0.69. This is actually not a bad result but when comparing to the major model using total deaths caused by air pollution, 0.69 is not good enough for predicting GDP.

We can get the conclusion that we can not get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.

7. Conclusion SMART Modeling

  1. For the Logit Regression, we observed small p-values indicating that all the coefficients are found to be significant. GDP has a positive effect on population, but deaths due to air pollution have a negative effect on population.

  2. As shown above, post-pruning, our accuracy is still 74%. So, pruning had little effect although the tree has become simpler. Therefore, we have a model that can predict high deaths due to air pollution with reasonable (74%) accuracy.

  3. For the clustering, it is really interesting to observe that the main clusters 1, 3, and 4 have large discrepancies in regional representation while the data we used for clustering did not have any explicit geospatial components. This can mean that regionality, although may not be the cause, does have some correlation in determining the groups of factors in GDP per Capita, Population, and Deaths due to Air Pollution per 100k.

  4. Although death caused by indoor air pollution and outdoor air pollution seems to have a strong relationship with total death caused by air pollution, but we can’t accurately predict GDP through death caused by indoor and outdoor air pollution.

8. Bibliography

Figure 8.1: References
Number APA Citation
1 Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html
2 Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf
3 Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html
4 Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html
5 Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html
6 K-Means Clustering in R. (2021, December 3). Retrieved from https://www.statology.org/k-means-clustering-in-r/